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)  The  test-taking  behavior  of  some  examinees  may  be  so  idiosyncratic  that 
their  test  scores  are  not  comparable  to  the  scores  of  more  typical 
examinees.  Appropriateness  indices  provide  quantitative  measures  of 
response-pattern  atypicality.  An  appropriateness  index  can  be  viewed  as  a 
test  statistic  for  testing  a  null  hypothesis  of  normal  test-taking  behavior 
against  an  alternative  hypothesis  of  atypical  test-taking  behavior.  In  this 
paper  performance  curves  and  the  performance  envelope  are  introduced  as 
devices  for  obtaining  a  least  upper  bound  for  the  power  of  the  most  powerful 
statistical  tests  for  aberrance.  The  performance  envelope  of  a  set  of  tests 
is  the  function  on  [0,1]  whose  value  at  t  is  the  least  upper  bound  of 
the  hit  rates  of  the  tests  when  their  false  positive  rate  is  t  .  The 
performance  curve  of  an  appropriateness  index  is  the  performance  envelope  of 
the  tests  for  aberrance  based  on  the  index.  For  some  types  of  testing 
anomalies  it  is  possible  to  determine  the  performance  envelope  for  the  set 
of  all  statistical  tests  for  aberrance  and  to  identify  a  test  whose 
performance  curve  is  identical  to  this  performance  envelope.  An  algorithm 


for  computing  some  of  these  optimal  tests  is  described,  and  an  example  of 
its  use  is  presented.  •'  p"  ‘  J 
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PERFORMANCE  ENVELOPES 

1 .  Introduction 

An  examinee's  test-taking  behavior  may  be  so  idiosyncratic  that  his/her 
test  score  is  not  comparable  to  the  scores  of  more  typical  examinees. 

Copying  and  other  forms  of  cheating  could  result  in  a  spuriously  high  score. 
Language  problems,  atypical  education  and  deliberate  failure  could  result  in 
a  spuriously  low  score. 

Some  atypical  examinees  produce  recognizably  unusual  answer  patterns. 
For  example,  in  a  recent  experimental  study  of  deliberate  failure, 
deliberately  failing  examinees  often  chose  obviously  incorrect  options, 
whereas  truly  failing  examinees  rarely  chose  these  options.  Furthermore, 
deliberately  failing  examinees  produced  the  option  response  sequence  ADADAD 
relatively  often;  however,  truly  failing  examinees  very  rarely  produced  this 
sequence . 

Appropriateness  measurement  attempts  to  detect  faulty  test  scores  by 
recognizing  unusual  answer  patterns.  The  standard  procedure  is  to  formulate 
a  model  for  normal  data  and  a  model  for  aberrant  data.  With  these  models 
the  identification  of  faulty  test  patterns  is  reduced  to  a  hypothesis 
testing  problem. 

To  date,  appropriateness  measurement  studies  have  been  highly 
empirical.  For  example,  to  determine  if  a  form  of  aberrance  can  be 
detected,  several  plausible  detection  procedures  are  tried  out  on  actual  and 
simulation  data  containing  normal  and  aberrant  response  patterns. 

There  are  a  number  of  questions  that  cannot  be  answered  satisfactorily 


by  these  empirical  methods.  To  return  to  the  example,  if  none  of  the 
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evaluated  detection  procedures  classifies  well,  then  it  cannot  be  concluded 
that  the  form  of  inappropriateness  could  not  be  detected  because  some  other 
procedure  may  have  worked  well. 

This  paper  introduces  a  general  method  for  obtaining  a  less  ambiguous 
answer  to  the  question  of  whether  a  specific  form  of  aberrance  is 
detectable.  Section  Two  presents  some  motivating  examples  of  applications 
of  our  results.  Section  Three  introduces  terminology  and  some  basic 
concepts.  Section  Four  develops  the  basic  theory  and  relates  it  to  several 
important  measurement  questions.  Section  Five  reviews  two  specific 
applications.  Section  Six  contains  some  mathematical  results  that 
demonstrate  that  the  theory  can  be  implemented  on  currently  available 
computers  for  these  two  applications.  An  algorithm  for  computing  some 
performance  envelopes  is  described  in  Section  Seven.  Section  Eight  provides 
an  illustrative  example. 


In  this  section  some  examples  are  used  to  motivate  and  to  describe  our 
results. 

Example  One:  Absolute  detectibility  of  a  specific  form  of  aberrance  in 
simulation  data. 

Simulation  data  sets  are  commonly  used  to  compare  tests  for 
appropriateness  and  decide  whether  a  specific  form  of  appropriateness  can  be 
detected.  These  studies  use  a  pair  of  computer  programs,  one  to  simulate 
normal  response  data  and  another  to  simulate  patterns  of  right  and  wrong 
answers  from  aberrant  examinees.  Since  an  arbitrarily  large  number  of 
examinees  can  be  simulated,  the  performance  of  any  statistical  test  for 
appropriateness  can  be  accurately  determined.  This  is  done  by  computing  the 
hit  rate  (proportion  of  aberrant  examinees  correctly  classified)  and  false 
positive  rate  (proportion  of  normal  examinees  incorrectly  classified)  with 
large  samples.  If  one  finds  a  test  with  a  high  hit  rate  and  low  false 
positive  rate,  then  one  concludes  the  specified  type  of  aberrance  can  be 
detected,  at  least  in  simulation  data.  (Of  course,  follow-up  studies  with 
actual  data  are  needed  to  verify  the  simulation  results.  However,  some  of 
our  results  are  more  easily  understood  with  simulation  studies.) 

Without  loss  of  generality  it  can  be  assumed  that  the  collection  of 
statistical  tests  being  considered  contains  at  least  one  test  with  false 
alarm  rate  equal  to  a  for  every  a  between  zero  and  one.  It  makes  sense 
to  determine  the  hit  rate  8  of  the  most  powerful  test  among  those  with  a 
given  false  alarm  rate.  In  fact  it  is  possible  to  consider  the  set  of  all 
statistical  tests  and  find  a  bound  at  each  a  .  For  some  important  special 


t* 

t 


-■ 


cases  we  have  developed  a  useable  way  to  compute  a  least  upper  bound  for  & 
at  each  a  .  In  fact  is  is  possible  to  describe  (and  compute)  a  statistical 
test  that  actually  achieves  the  maximum. 

These  results  are  important  because,  at  least  for  simulation  data,  they 
yield  an  absolute  measure  of  the  detectibility  of  the  specified  form  of 
appropriateness.  Thus,  after  applying  our  methods  to  a  particular 
appropriateness  measurement  problem  one  may  be  led  to  one  of  the  following 
conclusions: 

1 .  The  specified  form  of  aberrance  cannot  be  detected  very  well  by  any 
appropriateness  measurement  technique,  whatsoever;  or 

2.  There  is  no  point  in  attempting  to  improve  upon  a  developed, 
convenient  appropriateness  measurement  test  because  it  is  only 
slightly  less  powerful  than  all  superior  tests;  or 

3.  There  are  tests  that  are  substantially  more  powerful  than  the  tests 
currently  being  used,  and  significant  gains  in  power  may  be 
obtained  by  revising  appropriateness  measurement  techniques. 

Example  Two:  Choosing  between  dichotomous  and  polychotomous  models. 

Polychotomou3  analyses  are  considerably  more  difficult  than  analyses  of 
multiple  choice  data  scored  right  or  wrong.  For  a  specified  form  of 
aberrance,  a  specified  population,  and  a  specified  multiple  choice  test,  can 
one  substantially  improve  appropriateness  measurement  procedures  by 
attending  to  which  wrong  answer  was  chosen?  The  results  in  this  paper  are 
useful  for  answering  at  least  some  forms  of  this  question. 

Using  the  results  in  this  paper,  for  any  false  alarm  rate,  the  hit  rate 
of  the  most  powerful  statistical  test  can  be  computed,  at  least  for  some 
polychotomous  tests.  The  maximum  is  taken  over  all  tests,  including  those 
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that  are  sensitive  to  which  wrong  answer  was  chosen.  The  maximum  can  also 
be  computed  for  all  tests  that  treat  examinees  with  the  same  pattern  of 
correct  answers  equally,  i.e.  for  dichotomous  statistical  tests.  By 
comparing  maxima  one  can  better  decide  if  polychotomous  analyses  deliver 
enough  additional  power  to  be  worth  developing  and  implementing.  If  the 
maxima  are  close,  then  the  increased  sampling  error  in  the  polychotomous 
model's  parameter  estimates  may  off-set  gains  in  statistical  power. 

Example  Three:  Descriptive  models  of  actual  data. 

Since  a  multiple  choice  test  has  only  finitely  many  items  a  Markovian 
model  of  high  enough  order  will  exactly  describe  the  statistical  structure 
of  sampled  examinees.  Unless  there  are  complex  interdependencies  between 
nonadjacent  items,  lower  order  Markovian  models  will  adequately  approximate 
the  higher  order,  perfectly  descriptive  model.  There  are  other  families  of 
models  that  also  provide  increasingly  accurate  and,  finally,  a  perfectly 
descriptive  model  (e.g.,  Bahadur,  1968).  The  descriptive  models  generally 
require  very  large  samples  for  parameter  estimation;  however,  in  some 
appropriateness  measurement  tasks,  very  large  samples  of  normal  and  aberrant 
examinee  data  are  available  for  parameter  estimation. 

In  a  recent  study  of  deliberate  test  failure,  Markovian  models  of  order 
nT  and  nE  were  fitted  to  large  samples  of  truly  failing  examinees  and 
experimental  examinees  deliberately  failing  an  exam.  For  each  pair  of 
models,  the  reasoning  used  in  this  paper  was  applied  to  compute  an  optimal 
test  for  inappropriateness  in  Markovian  data.  For  nT-1  and  n  =2  a  test 

1  h 


was  obtained  which,  for  actual  data,  was  clearly  more  powerful  than  all 
available  alternative  appropriateness  tests. 


To  summarize,  the  results  in  this  paper  can  be  applied  to  a  sequence  of 
models  of  increasing  generality  and  used  to  approximate  a  bound  on  the 
performance  of  an  optimal  test  for  aberrance.  In  the  process  of 
approximating  a  bound,  a  powerful  test  for  aberrance  will  be  constructed. 


The  typical  problem  for  appropri ateness  measurement  is  to  find  a 
statistical  test  6  such  that  6(u)  is  much  more  likely  to  indicate 


aberrance  when  the  response  vector  u  has  been  generated  by  a  sampled 
aberrant  examinee  than  when  u  has  been  generated  by  a  sampled  normal 
examinee.  The  key  word  in  this  description  of  appropriateness  measurement 
is  "sampled."  From  a  practical  point  of  view,  it  makes  sense  to  consider 
examinees  as  sampled  since  they  report  for  testing  in  a  haphazard  order  and 
since,  except  when  they  are  cheating,  they  work  independently  of  one 
another.  From  a  theoretical  point  of  view  it  is  desirable  to  regard 
examinees  as  sampled  because  doing  so  leads  to  multinomial  item  response 
models,  simple  (as  opposed  to  composite)  statistical  hypotheses,  and  optimal 
appropriateness  measurement  tests.  A  brief  discussion  of  item  response 
theory  will  clarify  these  points. 

The  equations  of  item  response  theory  are  consistent  with  many 
conflicting  psychological  interpretations.  The  most  useful  one  for 
appropriateness  measurement,  in  our  opinion,  is  to  regard  each  examinee  as 

having  an  ability  0  and  item  scores  u^  ,  u 2 . un  '  Ttie  *tem 

scores,  according  to  this  view,  are  random  variables  because  the  set  of  all 
examinees  is  a  probability  space  and  not  because  any  examinee's  behavior  is 
uncertain.  Similarly,  9  is  a  random  variable  only  in  the  sense  that 
probabilities  are  assigned  to  sets  of  examinees  with  specified  9  values. 
"Measurement  error"  is  irrelevant  to  9's  status  as  a  random  variable. 

Thus  for  any  examinee,  say  examinee  w  ,  u1 (u>)  and  9(w)  are  numbers 
indicating  w's  response  and  ability,  respectively.  The  probability  that 
u1  is  zero  or  that  9  is  negative,  on  the  other  hand,  are  the 


9 


probabilities  assigned  to  the  set  of  examinees  answering  the  first  item 
incorrectly  or  having  an  ability  less  than  zero.  Thus  if  examinees  are 
regarded  as  sampled,  ProbCu^O}  and  Prob{9<0}  are  the  probabilities  of 
sampling  an  examinee  with  an  incorrect  first  answer  and  negative  ability. 

For  reference,  the  defining  equations  of  item  response  theory  are 
reproduced  below.  Our  results  follow  from  these  equations  and  do  not  depend 
upon  viewing  subjects  as  deterministic  and  sampled. 

The  basic  assumption  of  item  response  theory,  the  local  independence 
assumption,  is  generally  formulated  with  reference  to  the  item  response 
functions,  P^*),  P2(*),  .  .  .  ,  P  (•)  »  which  give  the  conditional 
probabilities  of  correct  (u^-1)  responses  at  each  ability  level.  Local 
independence  asserts  that 

(3.D  Probt^-u*  &  u2*u2  •  •  *  &  un"u*le'‘t^ 

n  u*  1-u* 

-  n  P  (t)  1Cl-Pi(t)] 
j-1 

where  u*  ,  the  observed  item  score,  is  either  zero  or  one. 

When  the  ability  density  is  known  or  accurately  estimated,  equation 

(3.1)  can  be  used  to  compute  unconditional  probabilities.  If  the  ability 
random  variable  has  density  f  ,  then  the  probability  that  the  response 
vector  u  equals  some  vector  of  zeros  and  ones  u*  is  obtained  by 
integrating  the  likelihood  function 

(3.2)  Prob{u-u*}  -  /Prob{u-u*| 9-t}f (t)dt  . 

In  many  item  response  theory  applications  the  ability  density  is 
ignored.  When  an  ability  density  is  not  specified,  then  the  likelihood 
function  (3.1)  specifies  a  continuum  of  models  for  normal  item  response 


data,  one  for  each  ability.  The  hypothesis,  "  u*  has  been  generated  by  a 
normal  examinee,"  is  composite  in  the  sense  that  there  is  a  different 
probability  that  u=u*  for  each  ability  level  t  .  Such  a  formulation 
leads  to  maximum  likelihood  ratio  tests  such  as  Levine  and  Rubin's  (1979)  LR 
test. 

When  the  ability  density  can  be  specified  or  accurately  estimated,  then 
the  hypothesis  that  u*  has  been  generated  by  a  normal  examinee  is  simple 
in  the  sense  that  formula  (3.2)  gives  a  unique  model  consistent  with  the 
hypothesis.  When  the  alternative  hypothesis  is  also  simple,  then  the 
likelihood  ratio  can  be  used  to  obtain  an  optimal  test  for  appropriateness. 
According  to  the  Neyman-Pearson  Lemma  (Lehmann,  1959)  a  statistical  test  of 
the  form 

"aberrant,"  if  P,K  .  (u*)  >  constants  „,.(u*) 

_  #  , %  Aberrant  -  Normal 

S(u*)  » 

"normal,"  otherwise 

has  as  much  or  more  power  for  detecting  aberrance  than  all  tests  with  the 
same  false  positive  rate. 

Note  that  when  the  ability  density  is  specified,  item  response  data  are 
multinomial  with  2n  categories.  Multinomial  conceptualizations  of  the 
usual  models  for  aberrant  data  will  be  formulated  as  they  are  needed.  The 
key  point  of  this  section  is  that  classical  statistical  results  for  testing 
simple  hypotheses  can  be  used  without  making  implausible  psychometric 


assumptions. 


4 .  Performance  Envelopes 


E^ch  of  the  examples  in  Section  Two  was  concerned  with  a  set  of 
statistical  tests.  For  example,  the  second  example  compared  a  set  of  tests 
using  polychotomous  data  with  a  set  of  tests  that  can  be  applied  to 
dichotomous  data.  In  this  section  a  device  for  studying  properties  of  sets 
of  tests,  the  performance  envelope,  is  introduced.  But  first  some  notation 
and  terminology  are  needed. 

The  basic  data  for  appropriateness  measurement  are  the  vectors  of  item 
responses,  here  denoted  by  u  .  A  deterministic  or  nonrandomized 
statistical  test  for  aberrance  is  a  binary  function  of  item  responses  taking 
on  the  values  1  (to  indicate  aberrance)  and  0  (to  indicate  the  absence 
of  aberrance).  Following  Lehmann  (1959,  p.  60),  a  pair  of  tests  can  be 
combined  to  form  a  randomized  test.  If  61 (u)  and  62(u)  are  tests  and 
°<p<1  then  d(u;61,62>p)  is  used  to  denote  the  randomized  test  which  is 
6.j(u)  with  probability  p  and  62(u)  with  probability  1-p  . 

This  paper  is  exclusively  concerned  with  properties  of  sets  of 
statistical  test  of  aberrance,  such  as  the  set  of  all  tests  that  can  be 
obtained  from  a  given  goodness-of-f it  statistic  or  the  set  of  all  statistics 
that  can  be  obtained  using  a  given  model  for  test  data.  The  mathematics  of 
comparing  sets  of  tests  is  simplest  when  these  sets  are  closed  with  respect 
to  routine  operations  and  methods  for  combining  tests. 

If  D  is  a  set  of  statistical  tests,  then  a  set  D  ,  possibly  equal  to 
D  ,  is  defined  as  the  set  of  tests  obtainable  from  tests  in  D  by 
"probability  mixtures"  (i.e.,  forming  randomized  tests  from  pairs,  triples 
or  larger  finite  sets  of  tests),  complementation  (i.e.,  forming  the  test 
1-6  from  6  ),  and  considering  the  trivial  test  (i.e.  the  test  6o(u)-1, 


which  labels  all  patterns  as  aberrant).  If  no  new  tests  can  be  constructed 
by  these  routine  operations  on  the  tests  of  D  ,  l.e.,  if  D-D  ,  then  D 
will  be  called  closed.  In  most  cases  of  interest  (see  below),  explicitly 
expressing  all  the  tests  of  D  with  formulas  containing  only  the  tests  of 
D  is  straightforward. 

To  evaluate  the  performance  of  a  (randomized  or  nonrandomized)  test  for 
aberrance,  two  conditional  probabilities  are  needed.  Using  the  suggestive 
terminology  of  signal  detection  theory,  these  are  the  false  positive  rate 
a( 6)  or  probability  of  misclassifying  a  randomly  sampled  normal  examinee 
and  the  hit  rate  8(<5)  or  probability  of  correctly  classifying  a  sampled 
aberrant  examinee.  In  hypothesis  testing  terminology,  these  are  the 
probability  of  a  type  I  error  and  the  power  of  6  respectively. 

Of  course  a  pair  of  distributions  P..  .  (u)  and  PM  .  (u)  over 

Aberrant  Normal 

response  vectors  must  be  specified  to  make  the  phrases  "randomly  sampled 
normal  examinee"  and  "sampled  aberrant  examinee"  unambiguous.  For  each 
individual  application  this  will  be  done. 

To  evaluate  the  performance  of  the  most  powerful  tests  that  can  be 
obtained  from  a  set  of  tests  D  ,  a  monotonic  real  function  is  introduced, 
the  performance  envelope.  If  D  is  a  set  of  statistical  tests,  then  the 
performance  envelope  of  D  is  the  function  R-Rp  defined  for  CKtO  by 

R(t)  -  least  upper  bound  (8(6):  5eD  and  o(6)-t)  . 

It  is  easy  to  prove  R  is  a  non-decreasing  function  with  values  between 
zero  and  one. 

Two  special  cases,  the  performance  curve  of  a  statistic  and  the 
performance  envelope  for  the  set  of  all  statistical  tests,  will  now  be  used 
to  illustrate  the  definition. 


4.1  The  Performance  Curve  for  a  Statistic 

Let  X  be  a  test  statistic,  i.e.,  a  number-valued  function  of  item 
responses  such  as  any  of  the  many  goodness-of-f it  indicators  proposed  as  an 
index  of  appropriateness.  For  each  "critical  score"  c  ,  two  statistical 
tests  for  aberrance  can  be  formulated.  One  of  them 

1,  if  X(u)<c 

5  (u)  - 

0,  if  X(u)>c 

treats  low  values  of  X  as  indicative  of  aberrance,  and  the  other,  1-6  , 

treats  high  values  as  indicative  of  aberrance.  The  performance  curve  for 

the  statistic  X  is  the  performance  envelope  of  the  set  of  statistics  of 

form  6  or  1-6 
c  c 

The  performance  curve  of  X  is  important  because  it  shows  how  well  X 

performs  in  classifying  examinees  at  each  false  alarm  rate,  in  the  following 

sense.  Let  Dx  denote  the  set  of  all  tests  of  form  6q  .  For  each  t  , 

there  will  be  a  statistical  test  6  obtainable  from  Dx  with  false  alarm 

rate  equal  to  t  and  hit  rate  equal  to  R_  (t)  .  This  test  can  be  regarded 

X 

as  most  powerful  or  optimal  among  the  tests  obtainable  from  Dx  with  a-t 
because  every  other  test  (with  false  alarm  rate  equal  to  t  )  will  have 
lower  or  equal  hit  rate.  The  word  "obtainable"  seems  especially  apt  here 
because  it  is  easy  to  show1  that  the  optimal  test  can  always  be  chosen  to  be 
one  of  the  nonrandomized  tests  or  a  randomized  test  obtained  from  Just  two 
nonrandomized  tests. 

The  performance  curve  for  X  differs  from  the  ROC  curve  for  X 
usually  used  in  appropriateness  measurement  in  that  it  is  continuous  and 
concave.  (Recall  that  the  ROC  curve  for  X  is  the  set  of  points  <x,y> 


with  x«a(5  )  and  y-8(6  )  for  some  nonrandomized  test  obtainable  from  X  . 
c  c 

Some  authors  use  "ROC  curve"  to  denote  a  curve  obtained  by  fitting  a  linear 
or  other  smooth  curve  between  points  and  thus  obtaining  a  continuous,  but 
not  necessarily  concave  function.) 

Since  there  are  only  finitely  many  response  patterns,  there  are  only 
finitely  many  points  <a(6c> ,8(5c)>  .  If  the  piecewise  linear  curve 
obtained  by  connecting  points  corresponding  to  consecutive  values  of  c  is 
the  graph  of  a  concave  function,  then  this  function  is  the  performance 
curve  for  X  . 

Computation  of  the  performance  curve  for  X  becomes  slightly  more 
complicated  if  the  ROC  has  points  below  the  diagonal  or  if  the  curve 
obtained  by  connecting  consecutive  points  is  not  concave.  One  considers  the 
finite  set  of  points  <a,8>  obtained  from  all  the  non-randomized  tests. 

One  obtains  a  curve  as  a  piecewise  linear  function  beginning  with  the  origin 
(or  the  point  with  highest  8  from  among  all  those  with  a-0  in  case  there 
are  nontrivial  tests  with  a-0)  as  the  first  node.  If  <a,8>  i3  the  n 
node  of  the  piecewise  linear  function,  then  the  next  node  is  <a,,B'>  where 
a',  S'  maximize  (8  over  the  subset  of  the  finite  set  with 

a  '  >a  . 

The  performance  curve  is  preferable  to  the  ROC  for  comparing  a  pair 
of  statistics  X  and  Y  for  two  reasons.  First,  for  each  t  either 
RY(t)_>Rv(t)  or  RY(t)<RY(t)  so  the  choice  between  X  and  Y  is  clear 
when  a  false  alarm  rate  t  is  desired,  even  when  there  is  no  nonrandomized 
test  with  false  alarm  rate  t  .  Second,  the  performance  curve  for  a 
statistic  X  is  concave,  but  the  ROC  curve  need  not  be.  Thus  for  some 
range  of  possible  false  alarm  rates  a  ,  say  between  t-e  and  t+e  for 
e>0  ,  a  randomized  test  can  have  higher  hit  rate  than  all  the  nonrandomized 


tests  <3  with  t-e<a(6)<t+e  .  Consequently  comparing  ROC  curves  can  lead 
to  the  wrong  choice  between  X  and  Y  to  use  for  constructing  a 
statistical  test  with  false  alarm  rate  near  t  . 

In  concluding  this  subsection  we  wish  to  point  out  that  sets  of  tests 
like  are  much  more  general  than  seems  to  be  realized.  A  set  of 

nonrandomized  statistical  tests  for  aberrance  has  nested  critical  regions  if 
for  any  5^  and  f>2  in  D  either  6^2  or  62<6.,  .  In  other  words, 

51  (u*)£62(u*)  for  every  response  pattern  u*  or  6^ ^ u* ® i  ^or  everY 

response  pattern  u*  .  Using  the  fact  that  there  are  only  finitely  many 
possible  response  patterns  it  can  be  shown  that  if  D  has  nested  critical 
regions  there  is  a  statistic  X  such  that  D-D^  and  the  performance 
envelope  for  D  is  the  performance  curve  for  X  . 

This  fact  is  important  because  it  shows  the  generality  of  the  approach 
to  appropriateness  measurement  we  use:  classifying  examinees  by  using  an 
"appropriateness  index"  or  real  valued  function  of  item  scores  and  a  range 
of  cutting  scores.  Any  set  of  tests  with  nested  critical  regions  can  be 
obtained  with  this  approach. 


JJ.2  The  Performance  Envelope  for  the  Set  of  All  Statistical  Tests 


At  least  in  some  situations  it  is  practical  to  consider  the  performance 
envelope  for  the  set  of  all  statistical  tests  for  aberrance,  and  this  leads 
to  a  second  illustration  of  performance  envelopes. 

As  noted  in  Section  Three  when  the  ability  distribution  is  specified 
formula  (3.2)  defines  a  simple  multinomial  model  for  item  response  patterns. 
Plausible,  simple  multinomial  models  (e.g.  the  spurious  high  and  spurious 
low  models  of  Sections  Five  and  Six)  are  appropriate  for  some  important 


forms  of  aberrant  data.  Thus  in  principal  the  likelihood  ratio  statistic 
X(u)  -  PAberrant^u^/PNormal^u^ 

can  be  defined.  In  Section  Seven  an  algorithm  for  calculating  X  is 
described. 

A  basic  result  for  this  research  is  that  the  performance  envelope  for 
the  set  of  all  statistical  tests  for  aberrance  is  the  performance  curve  for 
the  likelihood  ratio  statistic.  In  other  words,  for  any  statistical  test 
6  ,  there  is  a  test  obtainable  from  the  likelihood  ratio  statistic  with 
false  alarm  rate  equal  to  a(6)  and  hit  rate  at  least  as  large  as  B ( 6 ) 
This  fundamental  result  is  an  immediate  consequence  of  the  Neyman-Pearson 
Lemma  (Lehmann,  1959,  p.  63). 


Spurious  Scores  and  the  Computation  of  Envelopes 


In  the  remainder  of  this  paper  an  algorithm  for  computing  performance 
envelopes  for  some  important  models  is  described  and  illustrated.  The 
spurious  score  models  and  tampering  manipulations  have  provided  reference 
experiments  for  comparing  appropriateness  measurement  results  in  several 
laboratories  (Drasgow,  Levine,  &  Williams,  1985;  Levine  &  Rubin,  1979; 
Parsons,  1983;  Rudner,  1983).  Spurious  score  model  and  tampering 
experiments  are  also  important  because  they  can  be  used  to  predict  the 
performance  of  appropriateness  measurement  procedures  in  various  actual 
situations  without  collecting  additional  data. 

The  10%  spurious  high  tampering  manipulation  is  an  operation  on  an 


actual  or  simulation  examinee's  answer  sheet  that  changes  up  to  10%  of  the 
examinee's  item  scores.  In  this  manipulation  10$  of  the  items  are  sampled 
without  replacement.  Incorrect  answers  are  changed  to  correct  answers,  and 
correct  answers  are  left  unchanged. 

Data  conform  to  a  10$  spurious  high  model  if  the  likelihood  function 
for  each  item  response  pattern  is  the  likelihood  function  for  a  response 
pattern  generated  by  a  normal  examinee  and  then  modified  by  10$  spurious 
high  tampering.  An  explicit  formula  is  given  later  in  this  section. 

The  spurious  high  model  and  tampering  procedures  were  formulated  after 
considering  a  low  ability  examinee  copying  from  a  much  brighter  neighbor 
when  the  proctor  happened  to  be  distracted.  Of  course,  some  copiers  will 
risk  copying  on  10$  of  the  items  and  others  on  20$  or  5$  of  the  items. 
However,  after  a  distribution  on  the  percentages  is  specified,  results  from 
studies  in  which  the  percent  tampering  has  been  constant  can  be  combined  to 


approximate  performance  in  the  more  realistic  situation.  The  studies  in 
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which  percent  tampering  is  constant  are  basic  because  they  permit  the 
psychometrician  to  predict  for  an  arbitrary  percent  copying  distribution. 

According  to  the  10$  spurious  high  model,  the  likelihood  of  an  item 
response  pattern  u*  -  (u*,  u* .  u*)  is  a  sum  over  (n”1Q)  terms 


ln/10 


z  n  u» 
S  ieS 


u*  1-u* 

n  p  (t)  q  (t) 

US 


where  S  ranges  over  subsets  of  the  first  n  positive  integers  having 
exactly  n/10  elements.  Direct  computation  of  likelihoods  is  impractical 
because  for  n»90  ,  n/10-9  there  are  more  than  1011  terms. 

The  10$  spurious  low  tampering  manipulation  is  a  procedure  that  also 
revises  normal  item  response  patterns.  Exactly  10$  of  the  examinee's  item 
responses  are  sampled.  For  each  sampled  item  response  a  random  response  is 
generated.  If  the  generated  response  agrees  with  the  examinee's  response, 
no  change  is  made.  Otherwise,  the  examinee’s  item  response  is  changed  to 
the  generated  response.  Spurious  low  score  models  are  defined  analogously 
to  spurious  high  score  models  by  referring  to  a  two  stage  procedure;  the 
first  stage  conforms  to  a  model  for  normal  responding,  and  the  second 
stage  modifies  the  patterns  generated  in  the  first  stage  by  a  spurious  low 
tampering  manipulation.  The  likelihood  function  for  this  model  is 


( 


n 

n/10 


u*  1-u#  u*  1-u* 

z  n  a  1 ( i -a  )  1  n  p  (t)  1Q(t) 

S  ieS  US 


where  the  summation  is  over  subsets  of  the  first  n  positive  integers 
having  exactly  n/10  elements  and  where  the  A^  are  taken  to  be  one  over 
the  number  of  options  for  item  i  . 

The  spuriously  high  model  models  copiers  and  examinees  with  knowledge 
of  a  test's  answer  key  for  some  proportion  of  the  test  items.  The 


spuriously  low  model  models  random  responding  to  some  proportion  of  test 
i terns . 

Spurious  low  aberrance  can  also  be  interpreted  in  meaningful  ways. 
Consider,  for  example,  the  assessment  of  children  for  possible  assignment  to 
special  education  programs.  There  are  serious  concerns  about  the  meanings 
of  test  scores  when  tests  standardized  on  mainstream  samples  are 
administered  to  cultural  minorities.  This  is  particularly  important  when  a 
child  is  tested  in  a  second  language  in  which  he  or  she  may  not  be  fluent. 
His  or  her  responses  to  some  linguistically  demanding  items  may  be  nearly 
random.  The  seriousness  of  this  problem  is  underscored  by  the  fact  that 
"intelligence"  tests  cannot  be  used  in  California  when  assessing  minority 
children  for  special  education  (see  Hulin,  et  al.,  1983,  Chapter  9).  As 
before,  results  with  fixed  percentages  of  tampering  can  be  combined  to 
predict  for  situations  in  which  the  number  of  spurious  items  has  an 
arbitrary  specified  distribution. 


An  Algorithm  for  Calculating  Likelihoods 


The  major  obstacle  to  computing  performance  envelopes  for  the  spurious 
models  is  the  calculation  of  likelihoods.  An  algorithm  for  computing  these 
likelihoods  can  be  obtained  from  classical  results  on  symmetric  functions. 

In  this  section  a  highly  intuitive  derivation  not  requiring  symmetric 
functions  is  given.  The  intuitive  derivation  has  the  advantage  of  showing 
that  the  algorithm  can  be  used  to  study  a  large  variety  of  appropriateness 
problems.  It  appears  useful  for  modeling  tests  in  which  items  differ  in  the 
degree  to  which  they  elicit  an  aberrant  response  and  in  which  there  are 
complex  interactions  between  ability  and  tendency  to  cheat  or  otherwise 
perform  aberrantly. 

Consider  an  experiment  in  which  on  each  trial  an  examinee  is  presented 
an  item.  Suppose  on  trial  i  the  examinee  performs  normally  with 
probability  1-p  but  responds  aberrantly  with  probability  p  so  that  the 
probability  of  a  correct  response  can  be  written 


Cl-P4(t ,a) ]Pj_(t)  +  p^(t,s)A^(t)  . 


For  example  an  examinee  with  an  imperfect  "crib  sheet,"  ability  t  and 
inclination  to  cheat  s  risks  using  the  crib  sheet  to  answer  item  i  with 
probability  p^t.s)  and  then  answers  correctly  with  probability  A^(t)  or 
chooses  to  ignore  the  crib  sheet  witn  probability  1-p^t.s)  and  then 
answers  correctly  with  probability  P^(t)  .  In  this  interpretation  of  the 
equations,  A^t)  -  1  if  the  crib  sheet  has  the  correct  answer,  zero  if  the 
crib  sheet  has  the  wrong  answer  and  P ^ ( t )  if  the  crib  sheet  has  no 


information  on  the  item.  In  our  analyses  of  spurious  high  and  low  models, 


A ^ ( t )  will  be  1  or  the  reciprocal  of  the  number  of  options,  and  p  will 
also  be  independent  of  i  ,  t  and  s  . 

If  the  appropriate  independence  assumptions  are  made,  the  likelihood 
function  for  a  response  pattern  u*  will  be 

n  u* 

2.(u*;t,s)  -  n  {[1-p.(t,s)]P,(t)  ♦  p.(t,s)A  (t)}  x 
i-1  1  1  11 

1-u* 

l[1“P1(t,s)]Q1(t)  ♦  Pi(t,s)[1-A1(t)]} 

If  p^  ( t ,0) -0  ,  then  fc(u*;t,0)  is  the  likelihood  function  for  normal 
examinees . 

In  many  analyses  it  is  necessary  to  keep  track  of  the  number  of  items 
on  which  cheating  or  aberrant  responding  took  place.  To  this  end  an 
indeterminate  r  is  introduced  and  a  "probability  generating  function"  G 
is  defined  by 

n 

G(u«;r,t,s)  -  n  {[1-p  (t,s)]P  (t)  + 
i-1  1 

{[1“Pi(t,s)]Qi(t)  + 

If  G  is  written  as  a  polynomial  in  r  ,  then  the  constant  term, 

G(u*;0,t,s)  ,  is  the  probability  of  observing  u*  from  an  examinee  making 
no  aberrant  responses.  The  linear  term,  G I  Q  ,  is  the  probability  of 
observing  u*  from  examinees  making  exactly  one  aberrant  response.  More 

k  ak 

generally,  the  coefficient  of  r  (i.e.  (1/k!)  ■  G  evaluated  r-o)  will 

3r 

be  the  probability  of  pattern  u*  with  exactly  k  aberrant  responses.  If 
p^t.s)  -  .5  for  all  i  then  the  coefficient  of  rk  is  .5n  times  the 
sum  of  the  products  having  exactly  k  factors  selected  from  the  set 


rp1(t,s)A1(t)} 

rp1(t,s)[1-A1(t)]} 


*v  . 


u*  1-u* 

{ A . ( t )  1 [ 1 —A ^ ( t )  ]  1 :  i=1  ,  .  .  .  ,  n }  and  n-k  factors  from 

u*  1-u*  k 

(P.(t)  XQ.  1 :  i=1 .  n}  .  In  other  words,  the  coefficient  of  r 

is  (k)  (i.e.,  the  number  of  ways  to  select  k  items  from  n  )  times  .5° 

times  the  probability  of  observing  u*  when  exactly  k  responses  are 

aberrant  and  all  the  subsets  of  k  responses  are  equally  likely. 

To  simplify  the  evaluation  of  these  coefficients  G  is  divided  by 

the  constant  term  to  obtain 


G(u*,r ,s,t)  „  ri._n  t 

GTu«7o,~tT  n  [1  rBi]  * 


where  B. 


p.  (t,s)  A. (t) 

'l-p.(t,s)]  *  P7(t)  ’  if  Ui  ’  1  ’ 


p.(t,s)  1-A  (t) 

_ i -  x  - - - .  if  u*  *  0 

Cl-Pi(t,3)]  Q.(t)  ’  Ui 


Note  that  if  p^t.s)  equals  .5  for  each  item  i  ,  the  terms  in  p  drop 
out,  and  the  coefficient  of  r  in  Jl[1+rB^]  is  (k)s.(u*  ;t,0)  times  the 
probability  of  a  k/n  x  100?  percent  spurious  (high/low)  examinee  producing 
pattern  u*,  provided  the  A^(t)  terms  are  appropriately  chosen. 

This  formula  permits  enormous  computational  savings  because  the 
coefficients  of  the  powers  of  r  can  be  computed  recursively  with 
relatively  few  operations.  Since 


n  (1+rB.  )  -  [1+rB  ]  II  (1+rB. ) 
i-1  1  m  i-1 


H  (1+rB.)  +  rB  n  (1+rB.) 
i-1  1  m  i-1  1 


it  is  clear  that  the  coefficients  in  the  partial  products 


*V.n\  -V>' 


satisfy  the  recursion 


C  =  C  +  B  ,  C  , 

r ,m+1  r ,m  m+1  r-1 ,m 

where  CA  =  1  and  C  =0  for  r>m  . 

0,m  r,m 

To  illustrate  the  use  of  this  formula  consider  10. 6/6  spurious  low 
tampering  on  an  85  item  test.  The  are  specified  as  three  parameter 

logistic  functions  and  was  used  to  model  a  random  choice  from  the 

five  multiple  choice  options.  The  aberrant  items  were  obtained  by  sampling 
9  items  from  all  85  without  replacement.  The  likelihood  of  a  particular 
pattern  u*  being  sampled  from  among  all  examinees  having  parameters  t,s 
and  producing  exactly  9  aberrant  responses  is  the  sum  of  (^)  -  7.1  X  1011 
products,  each  of  which  has  many  factors.  There  is  one  product  for  each  way 
to  select  9  responses  from  85.  Thus  a  direct  computation  requires  85*1 01 1 
multiplications  at  each  ability  level. 

By  using  the  recursion  the  number  of  multiplications  can  be  greatly 
reduced.  The  desired  probability  is  equal  to 

(  g)  Mu*,t,0)  X  Cg^gg 

9 

where  Cg  g^  is  the  coefficient  r  in  the  polynomial 
85 

n  [ 1 +rB . ]  . 
i-1  1 

and  where  the  B^'s  are  computed  by  setting  p.(t,s)  -  1/2  and  A.(t)  »  .2 
To  calculate  Cg  g^  a  10  entry  array  is  revised  85  times.  Initially 
C  is  set  equal  to  1  ,  and  the  remaining  C's  ,  C, ,  C„,  .  .  .  ,  C„  ,  are 


set  equal  to  zero.  The  m  revision  replaces  by  the  current  value  of 

C  plus  B  times  the  current  value  of  C  .  for  r-1 ,  2 . 9  .  Cn 

r  m  *  u 

is  left  equal  to  1  .  Thus  the  eighty  five  revisions  require  less  than  850 
multiplications . 
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In  this  section  an  algorithm  is  described  for  computing  performance 
envelopes  for  the  set  of  all  statistical  tests  in  the  important  situations 
in  which 

1.  item  response  functions  are  specified 

2.  ability  distributions  are  specified  for  both  the  normal  and 
aberrant  populations,  and 

3.  data  from  aberrant  examinees  conform  to  a  spurious  high  model  or  a 
spurious  low  model. 

Each  of  these  conditions  is  commented  upon  separately  below. 

1.  Specified  item  response  functions  certainly  pose  no  problem  for  the 
reference  simulation  studies  that  are  commonly  performed.  A  variety  of  item 
response  function  estimation  procedures  are  available  for  actual  data  (Bock 
&  Aitkin,  1981;  Lord,  1968;  Samejima,  1981).  Levine  and  Drasgow  (1982) 
reported  experiments  for  measuring  the  effects  of  using  estimated  item 
response  functions  in  appropriateness  measurement  studies  with  actual  and 
simulated  data  and  in  which  the  item  parameter  estimation  sample  contains  a 
specified  proportion  of  unidentified  aberrant  examinees.  They  found  that 
with  sufficiently  large  item  parameter  estimation  samples  and  parameters 
estimated  with  LOGIST  (Wood,  Wingersky,  &  Lord,  1976)  from  samples  with  and 
without  aberrants  the  index  Lq  performed  about  as  well  with  estimated  item 
parameters  as  with  correct  item  parameters.  Portions  of  their  studies  are 
currently  being  repeated  to  gauge  the  effects  of  using  estimated  parameters 
on  performance  envelopes. 

2.  Ability  distribution  estimation  programs  are  available  (e.g. 

Levine,  1984,  1985;  Mislevy,  1984)  for  dealing  with  normal  populations. 


Levine  has  shown  that  his  method  is  strongly  consistent  and  asymptotically 
efficient  (1985).  Much  less  is  known  about  estimating  ability  distribution 
for  aberrant  examinees.  Furthermore,  the  aberrant  sample  will  generally  be 
quite  small.  However,  sometimes  it  is  acceptable  to  assume  that  ability  has 
the  same  distribution  in  both  populations;  other  times  the  ability 
distribution  can  be  specified  by  apriori  considerations.  For  example,  one 
of  the  hardest  and  most  important  tasks  for  appropriateness  measurement  is 
to  identify  spuriously  high  cheaters  with  ability  slightly  below  the  minimum 
required  to  qualify  for  military  technical  training.  To  measure  performance 
in  this  worst  case,  the  aberrant  distribution  is  assumed  to  uniform  over  a 
short  interval  below  the  critical  ability. 

3.  In  the  example  presented  in  the  next  section  10?  spurious  low 
aberrance  is  studied.  Essentially  the  same  algorithm  is  used  for  spurious 
high  aberrance.  We  feel  that  the  constant  percentage  3puriousness  condition 
is  especially  important  because,  as  noted  in  Section  Five,  these  studies  are 
used  as  reference  experiments  and  because  the  constant  percentage  studies 
can  be  easily  combined  to  predict  performance  without  collecting  new  data 
after  virtually  any  distribution  over  percent  spuriousness  has  been  chosen 
or  estimated.  However,  by  appropriately  specifying  the  p^t.s)  and 
A^t.s)  in  Section  Six,  item  effects  and  complex  interactions  between 
ability  and  "inclination  towards  aberrance"  can  be  modelled.  For  example 
two  values  of  could  be  used  to  model  the  fact  that  only  some  of  the 

items  were  available  to  a  coaching  school  or  a  dishonest  military  recruiter. 
The  s  variable  could  be  used  as  an  index  when  modelling  second  language 
problems  in  a  population  consisting  of  several  distinct  linguistic  groups, 
say  hispanics,  Mandarin  speaking  Chinese  Americans  and  examinees  speaking 


English  only.  In  any  event  the  basic  algorithm  suffices  for  a  variety  of 


optimal  appropriateness  measurement  problems. 

To  obtain  the  performance  envelope  for  the  set  of  all  statistics,  the 
performance  curve  for  the  likelihood  ratio  statistic  X 


Hu,)  ‘  P4berrant<u"u*,/PNo™allu-u*) 


is  computed.  To  approximate  the  X  performance  curve  the  sample  X  ROC 
curve  is  calculated  for  a  large  sample  normal  and  aberrant  examinees.  By 
using  the  fact  that  X(u)  assumes  only  finitely  many  values  it  is  easy  to 
show  that  with  probability  one  the  piecewise  linear  function  connecting 
consecutive  points  on  the  sample  ROC  converges  to  the  performance  curve  for 
X  . 

To  calculate  X(u*)  the  numerator  and  denominator  are  calculated 
separately.  For  normal  examinees,  the  likelihood  function  is  calculated  by 
substituting  the  specified  item  parameters  in 

1  “ci 

Pi(t)  +  i +exp[-ai(t-b1>]  ’ 

and  numerically  integrating  as  in  equation  (3.2)  to  obtain 

P»ormal(u,)  '  'J  < C P t < t ) ]“ 1 C 1 -P t ( t ) ]’  “‘ifUJdt  . 


Aberrant  *3  a^so  an  integrated  likelihood  function.  The  computation  of 
the  integrand  is  discussed  later  in  this  section  after  f  and  the 
integration  are  described. 

In  our  research  to  date,  we  have  taken  the  density  f  to  be  normal 
(0,1)  or  normal  (0,1)  truncated  to  the  interval  [-2.05,2.05]  when 
generating  simulation  data  and  evaluating  the  integrals  to  compute  PNormal 
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and  Aberrant  *  Although  normality  is  not  required,  our  current  algorithm 

does  take  advantage  of  some  of  its  properties.  In  particular,  it  uses  the 

facts  that  the  normal  density  is  continuous  and  "flat"  relative  to  the 

likelihood  functions  for  abilities  less  than  2.05  in  absolute  value.  The 

normal  density  varies  from  .05^  to  .399  on  the  interval  [-2.05,2.05]  , 

but  the  likelihood  function's  maximum  is  usually  1 01 ^  to  10^  times  as 

large  as  its  minimum  on  the  interval  for  the  85  to  95  item  tests  we  have 

studied.  Consequently,  portions  of  the  interval  [-2.05,2.05]  can  often  be 

ignored  with  little  loss  of  accuracy  when  computing  probabilities. 

The  integrals  in  P..  ,  and  P,.  .  are  being  evaluated  by 

Normal  Aberrant 

Simpson's  rule.  For  both  probabilities  we  obtained  four  to  five  digit 
accuracy  when  the  distance  A  between  quadrature  points  was  .20  and  five 
to  six  digit  accuracy  for  A  -  .10  .  We  have  generally  used  A  -  .10  in 
our  calculations  because  it  seemed  to  provide  the  best  trade-off  between 
numerical  accuracy  and  computing  expense. 

The  number  of  function  evaluations  can  be  reduced  by  first  computing 

A 

the  maximum  likelihood  estimate  8  of  ability  given  u«  .  Let  g  denote 

A 

the  function  to  be  integrated.  Then  g  can  be  evaluated  at  points  8-iA, 

a 

i-1 ,  2,  .  .  .  ,  m1 ,  until  g(0-iA)  becomes  very  small.  The  algorithm 

A  _4 

requires  g(e-lA)  to  be  less  than  10  times  as  large  as  g(8)  . 

A 

Similarly,  g  can  be  evaluated  at  points  0  +  iA,  i-1 ,  2,  .  .  .  ,  m2  .  If 
the  total  number  of  function  evaluations  is  odd,  then  Simpson's  rule  can  be 
applied  immediately.  When  the  total  is  even,  one  more  function  evaluation 
should  be  obtained  before  application  of  Simpson's  rule.  We  have  found  that 
the  number  of  function  evaluation  can  often  be  reduced  by  50?  for  A-.10  by 


this  rule. 


The  recursive  algorithm  described  in  Section  Six  is  used  to  calculate 
the  likelihood  function  for  aberrant  examinees.  The  algorithm  is  first 
summarized  with  no  more  generality  than  is  needed  for  the  spurious  high  and 
low  studies.  The  remainder  of  this  section  discusses  refinements  of  the 
basic  algorithm  for  spurious  high  and  low  studies. 

Recall  that  the  likelihood  function  for  spurious  high  aberrance  is 

P  (u-u#| 9-t)  -  E  Probability  {set  S  is  sampled  for  tampering}  « 

A  S 

n 

n  Prob{u . -u* j 0-t  and  S  is  sampled} 
i-1  1  1 

.  u*  1-u* 

-  (") z  n  U*  n  p.(t)  Q.(t) 

S  ieS  ilfs 

Now  if  S  contains  one  or  more  of  the  incorrectly  answered  items  n  u*  -  0  . 

ieS  1 

Consequently  the  summation  can  be  taken  over  all  k  element  subsets  of  the 

nQ  correctly  answered  items  rather  than  of  the  n  items,  and  the  second 

product  will  be  divisible  by  W(t)  -  n  Q.(t)  .  Thus 

i:U»-0 

p  (u-u»|0-t)  -  (")_1w(t)  i  n  p  (t) 

S  l:i*S'  & 
u*-1 

n 

where  the  summation  is  over  the  (  )  k-element  subsets  S'  of  the  set  of 
correct  items  in  pattern  u*  .  In  other  words,  the  summation  is  the  (nc~k)th 
symmetric  function  on  the  vector  of  nc  not  necessarily  distinct  variables 

<P.  (t),  P  (t),  .  .  .  ,  P.  (t)>  where  i.  <  i.  .  and  u*  -  1  .  To 

1  2  n  3  J 

c 

evaluate  the  summation  we  use  the  well-known  recursion  given  a  probabilistic 


interpretation  in  Section  Six 


T(r+1 .j)  -  T(r,J)  ♦  *r+1T(rtJ-1) 


discussed  in  Section  Six.  Here  T(i,j)  is  the  i  elementary  symmetric 
function  on  the  first  j  variables  in  a  vector  or  sequence  of  numbers 
<x  ,  x2,  .  .  .  >  ,  i.e.  the  sum  of  the  (^)  products  having  i  factors 
selected  from  the  first  j  numbers. 

For  spurious  low  aberrance  the  likelihood  function  is 


Aberrant 


(u-u*|9-t)  -  I  Probability  {set  S  is  sampled  for  tampering} 
S 


n  Prob{u . -u*| 9-t  and  S  is  sampled} 
i-1  1  1 

u*  1-u*  u*  1-u* 

(")“  z  n  P  1(i-p)  n  p  (t)  Q.(t) 

K  S  ieS  iHS 


where  the  summation  is  over  k  element  subsets  S  of  the  n  items  and 
where  p-.2  is  the  probability  of  being  correct  when  responding  randomly  on 
a  5  option  multiple  choice  test.  To  expeditiously  calculate  the  likelihood 
for  a  pattern  u*  with  nQ  correct  and  nw-n-nc  wrong  we  rewrite  this  as 


n  n  u*  1  -u* 

(”)“1p  (i-p)  e  n  [p.(t)/p]  l[Qf(t)/(i-p)] 


S  ills 


ui  1-ui  th 

and  evaluate  Z  n  [P.(t)/p]  1[Q. (t)/(1-p)]  as  the  (n-k)  symmetric 

S  iHS 

u*  1-u* 

function  on  the  vector  <[P1(t)/p]  [Q1 ( t ) / ( 1  — p ) J  . 

u*  1-u* 

[P  (t)/p]  °[Q  ( t )/( 1 -p) ]  n> 
n  n 

A  considerable  further  reduction  in  computation  can  be  obtained  by 

U3ing  the  fact  that  the  (m-k)th  symmetric  function  in  <x1 ,  .  .  .  ,  xm> 
m 

equals  II  x.  times  the  k  symmetric  function  in  <x,  ,  .  .  .  ,  x  >  . 


t  n  [p  (t)/p]  [Q, (t)/(i-p) ] 

S  iefS  1 


n  [P1(t)/p]  ^QjUJ/O-p)]  1 


u*  1 -u* 

l  n  [p/p  ( t ) 3  1Cq/Q. (t ) ] 

S  ieS 

nc  nw  ^h 

=  p  q  £(u*,t)  *  the  k  symmetric  function  in 


<Cp/P1(t)]  [q/Q^ (t ) ]  [p/Pn(t)  [q/Qn(t)] 

The  same  identity  gives  a  reduction  in  the  amount  of  calculation  for 

spurious  high  analyses  for  patterns  u*  with  k  <  nQ-k  . 
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8.  An  Illustrative  Example 

To  illustrate  the  algorithm  described  in  Section  Seven,  item  parameter 
estimates  obtained  from  Levine  and  Drasgow's  (1983)  fitting  of  the  three 
parameter  logistic  model  to  the  85  item  April,  1975  Scholastic  Aptitude 
Test-Verbal  section  (SAT-V)  were  taken  as  simulation  parameters.  One 
thousand  normal  response  vectors  were  created  by  sampling  abilities  from  a 
normal  (0,1)  distribution  truncated  to  the  [-2.05,2.05]  interval, 
computing  the  logistic  probabilities  of  correct  responses,  and  then  scoring 
each  simulated  item  response  as  correct  or  incorrect  depending  upon  whether 
a  number  sampled  from  a  uniform  [0,1]  distribution  was  less  than  or 
greater  than  the  logistic  probability.  A  sample  of  500  spuriously  low 
response  patterns  was  created  by  first  generating  500  normal  response 
patterns.  Then  nine  simulated  items  were  randomly  selected  without 
replacement  from  each  response  pattern  and  each  item  was  rescored  to  be 
correct  with  probability  .2  and  rescored  to  be  incorrect  with  probability 
.8  .  The  likelihood  ratio  statistic  was  computed  for  all  1500  patterns,  as 
described  in  Section  Seven. 

Table  One  presents  the  proportions  of  spuriously  low  response  patterns 
correctly  classified  as  aberrant  when  various  proportions  of  normal  response 
patterns  are  misclassif ied  as  aberrant.  The  table  also  presents  the  results 
for  the  standardized  lQ  index  studied  by  Drasgow,  Levine,  and  Williams 
(1985).  It  is  evident  that  the  envelope  curve  statistic  provides  a 
substantial  improvement  over  the  standardized  index.  This  finding  is 

important  because  in  previous  research  (Drasgow,  1982;  Levine  &  Drasgow, 
1982;  Levine  &  Rubin,  1979)  we  have  been  unable  to  find  an  index  that 
provides  detection  rates  that  are  clearly  superior  to  S,Q  . 


Proportion  of  Normal  Response  Proportion  Detected  by 
Patterns  Classified  Envelope  Standardized 

As  Aberrant  Curve  4 

Statistic 


Footnotes 


This  work  was  supported  by  United  States  Office  of  Naval  Research 
contracts  N0001 4-79C-0752,  NR  154-445  and  N0001 4-83K-0397 ,  NR  150-513, 
Michael  V.  Levine,  Principal  Investigator. 

^ <t , R_  (t)>  is  on  the  boundary  of  a  convex  polygon  because  the  range 
X 

of  X  is  finite.  Therefore  <t,Rn  (t)>  is  a  vertex  (and  a  nonrandomized 

X 

test  is  optimal)  or  <t,R_  (t)>  is  on  a  line  segment  connecting  two 

DX 

vertices  (and  a  randomized  test  obtained  from  the  two  tests  associated  with 


the  segment  is  optimal). 
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